Skip to content

Optimize 1-bit estimator tail path#49

Open
Flitieter wants to merge 1 commit into
VectorDB-NTU:mainfrom
Flitieter:pr/include-optimized-estimator-cleanup
Open

Optimize 1-bit estimator tail path#49
Flitieter wants to merge 1 commit into
VectorDB-NTU:mainfrom
Flitieter:pr/include-optimized-estimator-cleanup

Conversation

@Flitieter
Copy link
Copy Markdown

This PR introduces a tail-padded AVX-512 path:
Full 512-dimensional blocks are processed with 512-bit SIMD loads and popcount, while the final partial block is stored compactly and handled with masked AVX-512 loads. This avoids falling back to scalar-style tail processing and keeps the tail path vectorized.

@Flitieter
Copy link
Copy Markdown
Author

Tail-pad report

1. What problem does tail-pad address?

The 1-bit estimator is built around an AVX-512 512-dimensional path.
When the input dimension is not a multiple of 512, the remaining non-full block has to be handled separately. In the original path, this tail part is processed in a more conventional way, which leads to lower SIMD utilization and extra overhead on the final block. As a result, the estimator becomes noticeably slower when the tail is large.

2. Current tail-pad method

The current tail-pad method directly pads the remaining dimensions to a full 512-dimensional block, so the tail can stay on the same AVX-512 path as the main body.

In other words:

  • full 512-dim blocks use the regular AVX-512 estimator path,
  • the final partial block is padded to 512 dims,
  • the estimator can then continue using 512-bit vector loads, bitwise operations, popcount, and reduction without falling back to a less efficient tail path.

This keeps the implementation simple and improves the utilization of the AVX-512 path on non-512-aligned dimensions.

3. Single-function results

The following table reports median per-call latency for the single-function benchmark (warmup_ip_x0_q_512) on the OpenAI-1536 dimension sweep.

dim remaining_dim origin (ns) 512-bit tail pad (ns) speedup
256 256 17.5343 8.7724 50.0%
320 320 20.1750 8.7747 56.5%
384 384 23.1806 9.7438 58.0%
448 448 25.9888 9.1661 64.7%
512 0 10.1682 9.6581 5.0%
640 128 14.7538 12.5987 14.6%
768 256 20.1823 13.1654 34.8%
960 448 29.5875 13.5227 54.3%
1024 0 19.5496 18.9976 2.8%
1280 256 26.5937 18.3340 31.1%
1536 0 25.3903 24.8515 2.1%

Overall, the gain is small when the dimension is already 512-aligned, but becomes significant when the remaining tail is large. The largest improvements appear when remaining_dim is close to 448, where the tail overhead is most visible in the original path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant